I am looking forward to the course for two reasons:
Another initial impression of the course to me that the moodle page is rather verbose and chaotic and this hampers finding the necessary information and the assigned tasks. I hope that I will get used to this and will not miss any compulsary assignment.
I have found the course in Sisu because I still needed one credit to my Transferable skills section.
## [1] "I've found the book interesting and well written. Naturally due to my previous experience in R, I was only skim-reading it. Nevertheless I have found some useful functionalities that I haven't used before. e.g."
filter(year == 2020 | disease == COVID19)gapdata2007 %>%
ggplot(aes(x = continent, y = lifeExp, fill = country)) +
geom_col(position="dodge") + theme(legend.position = "none")
percent() function in tidyverse, it’s much simpler than sprintf(%2.1f%%).plotly and html output of Rmarkdown one can create interactive htmls (as one can hover over the diagram below to highlight countries).plot = gapdata2007 %>%
ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, label = country)) +
geom_point()
ggplotly(plot)
Describe the work you have done this week and summarize your learning.
We got data on some sort of questionnaire responses from individuals and their results on some sort of test (exam). The questionnaire seems to refer to study habits. The origin of the data cannot be clearly identified from the metadata provided.
df = as_tibble(read.csv('http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt', sep = '\t'))
learning2014 = read_csv('data/learning2014.csv') %>%
rename(points = Points, attitude = Attitude)
The original raw data consists of 183 records and 60 variables. After summarizing some questions into groups (such as questions referring to in-depth studying into a “depth” group) by taking their mean and removing observations with 0 points the processed data consists of 166 observations and 7 variables. Information on the data structure can be found below.
str(learning2014)
## spec_tbl_df [166 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ Age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. Age = col_double(),
## .. Attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. Points = col_double()
## .. )
Let’s have a look at the distribution of variables in this data by plotting them pairwise on a scatter plot. The figure below provides detailed insight into this data. We suspect that there are differences in the male and female participants hence we stratify the data into these two groups by colors.
colMap = c("F" = "#FF5555", "M" = "1100EE")
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(color = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor")) ) +
scale_color_manual(values = colMap) +
scale_fill_manual(values = colMap)
We can read several pieces of information from the plot. First there were almost 1.5 times more females (red) in the study than males (blue) as can be seen by the barchart on the top-right. Based on the boxplots on the top the males attitude was higher towards statistics however there were no huge difference between the points of males and females based on gender. Nevertheless the positive correlation between attitude and points is observable for both males and females.
In a linear regression analysis we are searching for a linear relationship between a predictor (feature) variable and an expected target outcome. Let’s pick 3 that are the most likely candidates based on the previous plot:
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The results show that the attitude is significantly useful when predicting the exam points as the p-value for the slope of this variable is very low (\(1.93*10^{-8}\)). The F-statistic of the test also indicates that at least one of the selected variables has non-zero slope.
The t-test for each variable has the null-hypothesis that the slope of that variable in the regression equation is 0. If that is true then there is no association between the variable and the target. If the p-value is low then the result is unlikely under the null-hypothesis so usually we reject it i.e. we believe that there is association.
So fit the model again only with attitude (as that was the only one showing significance)
my_model <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The summary of this model shows that there is a positive correlation (with a slope of 3) between the attitude results and exam points. In lay terms this means that 1 unit increase in the attitude reflects more or less 3.5 points increase in the exam results. In addition to this there is a baseline of 11.6 for those with the lowest attitudes. The multiple R squared value gives the proportion of variation that is explained by the variables in the model. In this case it is 0.1906, so a little less than 20% of the variation is explained by the explanatory variable meaning that there are significant other factors affecting the exam results that we are omitting here.
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
The linear regression model assumes that the observations are produced by a linear model from the explanatory variable and there is additive noise with normal distribution (0 expected value) to these. The residuals represent this normal distribution, hence it’s good to check how normal these are, and if the residual values are independent of the “x” location (in other case the noise would be dependent on the observation). The first two plots refer to this. On the first one we can see that the residuals are evenly distributed over the fitted values. The Q-Q plot shows that the quantiles of the residual distribution are very close to the quantiles of a normal distribution. The last “Residuals vs. Leverage” plot indicates which variables may have the greatest effect on the parameters (it is also known that linear regression is outlier-prone), hence it could be useful to check of the model improves a lot by removing those observations.
Beyond explaining the data linear regression models can be also used to predict unobserved results such as below.
new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(attitude = new_attitudes)
# Print out the new data
new_data
## attitude
## Mia 3.8
## Mike 4.4
## Riikka 2.2
## Pekka 2.9
# Predict the new students exam points based on attitude
predict(my_model, newdata = new_data)
## Mia Mike Riikka Pekka
## 25.03390 27.14918 19.39317 21.86099
As one can see it.
We got data on student performances at school, including Grades, number of previous class failures and connected background information such as family status, alcohol consumption etc. We had information from 2 classes: Math and Portugese. To create a joint dataset we round-averaged the grades from these 2 classes and included only students that were present in both classes (at least based on their attributes, the table does not contain unique student identifiers causing possible confusions).
# Double check the data wrangling result
df_dc = as_tibble(read.csv('https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv', sep = ','))
df = read_csv('data/student_joint.csv')
#full_join(df,df_dc)
# ok. this is of same size, e.g the tables must agree.
After these operations there are 370 students with 35 variables in the data.
str(df)
## spec_tbl_df [370 x 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr [1:370] "R" "R" "R" "R" ...
## $ famsize : chr [1:370] "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr [1:370] "T" "T" "T" "T" ...
## $ Medu : num [1:370] 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : num [1:370] 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr [1:370] "at_home" "other" "at_home" "services" ...
## $ Fjob : chr [1:370] "other" "other" "other" "health" ...
## $ reason : chr [1:370] "home" "reputation" "reputation" "course" ...
## $ guardian : chr [1:370] "mother" "mother" "mother" "mother" ...
## $ traveltime: num [1:370] 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : num [1:370] 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ famsup : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ activities: chr [1:370] "yes" "no" "yes" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "yes" "yes" "no" "yes" ...
## $ romantic : chr [1:370] "no" "yes" "no" "no" ...
## $ famrel : num [1:370] 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : num [1:370] 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : num [1:370] 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : num [1:370] 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : num [1:370] 1 4 1 1 3 1 2 3 3 1 ...
## $ health : num [1:370] 1 5 2 5 3 5 5 4 3 4 ...
## $ failures : num [1:370] 0 1 0 0 1 0 1 0 0 0 ...
## $ absences : num [1:370] 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : num [1:370] 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : num [1:370] 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : num [1:370] 12 8 12 9 12 12 6 10 16 10 ...
## $ paid : chr [1:370] "yes" "no" "yes" "yes" ...
## $ alc_use : num [1:370] 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi [1:370] FALSE TRUE FALSE FALSE TRUE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. paid = col_character(),
## .. alc_use = col_double(),
## .. high_use = col_logical()
## .. )
The 4 variables picked to examine if they have correlation with high alcohol consumption. Below I explain my a-priori expectations.
studytime: likely the less someone study the more time is available for consuming alcohol. It is unlikely that someone would use alcohol to mitigate stress coming from too much learning.activities: the more activities someone does the less time they have for hanging out with bad student groups, I expect again negative correlationhigher (meaning higher education plans): Student’s with cleaner future planning are less likely to be heavy alcohol consumers.freetime: probably students with too much free-time will get more bored hence consumer more alcohol to make their life exciting. p1 = df %>% ggplot() + aes(x = high_use, y = studytime, fill = high_use) + geom_violin()
p2 = df %>% ggplot() + aes(x = activities, fill = high_use) + geom_bar(position = "fill")
p3 = df %>% ggplot() + aes(x = higher, fill = high_use) + geom_bar(position = "fill")
p4 = df %>% ggplot() + aes(x = high_use, y = freetime, fill = high_use) + geom_violin()
grid.arrange(p1,p2,p3,p4)
m = glm(high_use ~ studytime + activities + higher + freetime, data = df, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ studytime + activities + higher + freetime,
## family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6471 -0.8668 -0.6585 1.1902 2.1362
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0841 0.7071 -0.119 0.905327
## studytime -0.5273 0.1577 -3.344 0.000826 ***
## activitiesyes -0.2283 0.2400 -0.951 0.341541
## higheryes -0.7545 0.5398 -1.398 0.162162
## freetime 0.3340 0.1246 2.680 0.007359 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 423.64 on 365 degrees of freedom
## AIC: 433.64
##
## Number of Fisher Scoring iterations: 4
The model suggests that the studytime and freetime variables are statistically significantly useful for predicting high-low alcohol usage. It is interesting that the higher education prediction is not significant, but I double checked the data, and it is severely skewed towards students planning higher education (16 students not planning while 354 planning). This means that this variable has not much affect on predicting the higher education planning but yet high alcohol consumers. The categorical variables (activities and higher) were converted internally to “dummy” representative levels. As there was only 2 levels in both cases only 1 dummy variable was created.
coef(m)
## (Intercept) studytime activitiesyes higheryes freetime
## -0.08410415 -0.52727634 -0.22825310 -0.75452928 0.33399121
exp(coef(m))
## (Intercept) studytime activitiesyes higheryes freetime
## 0.9193355 0.5902103 0.7959228 0.4702319 1.3965309
The coefficients represent the correlation between the variable and the output, the sign represents the shape of the logistic function, i.e. negative coefficient means that higher value (e.g. studytime) results in lower output (i.e. no high alcohol usage). In the exponential form these represent odds ratios for the unit change, in other words associating the variable’s importance in effecting the model. Values close to 1 would have no effect, deviations in either direction shows that the variable is important for the prediction.
confint(m)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.47694251 1.3160333
## studytime -0.84575068 -0.2261421
## activitiesyes -0.70031246 0.2420972
## higheryes -1.84750298 0.3028107
## freetime 0.09312641 0.5827670
These are 95% confidence intervals. It is usually understood that if these are not containing 0 then it is a significant result and indeed this matches with the significance table of the model. The results are matching with my hypothesis except for the higher education variable but that is explained by the skewing.
# Use only the significant variables
m2 = glm(high_use ~ studytime + freetime, data = df, family = "binomial")
probabilities <- predict(m2, type = "response")
df %<>% mutate(probability = probabilities)
df %<>% mutate(prediction = if_else(probability>0.5,TRUE,FALSE))
# tabulate the target variable versus the predictions
tt = table(high_use = df$high_use, prediction = df$prediction)
tt
## prediction
## high_use FALSE TRUE
## FALSE 250 9
## TRUE 102 9
This table is also called the confusion matrix. The usual metrics calculated from this table are the precision (TP/(TP+FP)) that is 0.5. This is pretty bad. Another common metric is recall (TP/(TP+FN)) that is 0.0810811 that is almost 0. Accuracy (the correct preddiictions / total predictions) was a bit better: 0.7. The training error is naturally 1-accuracy i.e. 0.3. This model performs pretty bad on predicting the high alcohol users, but pretty good on predicting the actually non high consumers. Simple guessing (without any other knowledge) should give a value around 0.5 hence this model is still a little better than random guess.
# load the data
data("Boston")
df = as_tibble(Boston)
In this exercise we are working with a dataset that is readily available in the MASS (Modern Applied Statistics with S) package of R and contains socio-economic features of Boston suburbs. More info on the data can be found at: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html. The raw data consists of 506 records that refer to different suburbs and 14 variables including for instance the crime rate of the town, the average age of buildings etc.
str(df)
## tibble [506 x 14] (S3: tbl_df/tbl/data.frame)
## $ crim : num [1:506] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num [1:506] 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num [1:506] 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int [1:506] 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num [1:506] 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num [1:506] 6.58 6.42 7.18 7 7.15 ...
## $ age : num [1:506] 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num [1:506] 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int [1:506] 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num [1:506] 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num [1:506] 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num [1:506] 397 397 393 395 397 ...
## $ lstat : num [1:506] 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num [1:506] 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Let’s have a look at the distribution of variables in this data by plotting them pairwise on a scatter plot.
p = ggpairs(df, lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 2)) ) +
theme(text = element_text(size = 8), axis.text.x = element_text(angle = 45))
p
# It's a bit hard to see with the ggpairs in this case, however more informative then the simple corrplot.
cor_matrix <- cor(df)
corrplot(cor_matrix, method="circle")
The variables follow various distributions. For instance the crim variable that encodes for crime rates in town seems to follow some sort of power distribution, i.e. there are some suburbs with very high crime rates and then the others are rather low. On the other hand the rm variable that encodes room number clearly has a normal distribution centered around 6 rooms per dwelling (that feels a bit too high though). Yet again, indus and tax that encode for “proportion of non-retail business” i.e. how industrialized a town is and “full-value property tax rate” respectively, seem to have a bi-modal distribution suggesting that for instance there the industrialized suburbs separate from the living quarters.
One of the strongest correlation is related to this industrial variable, it is negatively correlated to the dis variable that measures “weighted mean of distances to five Boston employment centres.” This makes sense, likely the industrialized (including also likely white collar businesses) areas are employment centers, so the less industrialized a district is the further away it is from these employment centers. The nitrogen oxid concentration is strongly positively correlated to this distance meaning that these employment centers are not high in NO concentration, suggesting that the employment centers in boston are high-tech, non-polluting industries.
As all of these variables are continuous let’s scale them to have 0 mean and 1 standard deviation, by
\[scaled(x) = \frac{x - mean(x)}{ sd(x)}\]
df_scaled = as_tibble(scale(df))
summary(df_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
As it was expected the means became 0.
bins = quantile(df_scaled$crim)
# create a categorical variable 'crime'
df_scaled %<>% {mutate(.,crim = cut(.$crim, breaks = bins, include.lowest = TRUE))}
table(df_scaled$crim)
##
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 127 126 126 127
# The uniform distribution makes sense, as we divided by quantiles.
n = nrow(df_scaled)
ind = sample(n, size = n * 0.8)
train = df_scaled[ind,]
test = df_scaled[-ind,]
correct_classes = test$crim
test %<>% dplyr::select(-crim)
lda.fit <- lda(crim ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
The LDA suggests that the “access to radial highways” rad is positively correlated with the highest crime-rate group.
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## [-0.419,-0.411] 15 10 0 0
## (-0.411,-0.39] 7 16 2 0
## (-0.39,0.00739] 1 6 19 2
## (0.00739,9.92] 0 0 1 23
The confusion matrix shows that this model is pretty good in predicting the correct crime rate, the highest values are found in the diagonal. It is also visible that the easiest is to predict the highest crime rate (blue color on the LDA plot), as the LDA transformation has separated that group the best from others. The 3 lower crime group rate are more mixed hence more errors are made during the prediction.
# We don't need reload just do this again
df_scaled = as_tibble(scale(as_tibble(Boston)))
dist_eu = dist(df)
# I'm a bit confused, I don't think this is needed for the k-means function, but the assignment requires to calculate it. I think k-means will do this on its own.
km = kmeans(df_scaled, centers = 4)
pairs(df_scaled, col = km$cluster)
Not much is visible on these figures. I unfortunately tend to see 2 clusters mostly along the black variable suggesting segregation in this area.
set.seed(123)
# Let's not overdo it, k-means is a fairly expensive technique
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(df_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <- kmeans(df_scaled, centers = 2)
p = ggpairs(df_scaled %>% mutate(cluster = as_factor(km$cluster)), mapping = aes(color = cluster, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 2)) )
p
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
Indeed, based on the dropping WCSS (within cluster sum of squares, a measure of coherency within a cluster, but note: clustering is by definition dubious!) two clusters is a good number in this data, and it reveals segregated societal groups (districts) with clear separation in crime rates and lower status of the population (lstat). It’s interesting that the nitrogen-oxid pollution is actually better for the group with lower status. The lower status obviously pays less taxes (connected to likely less income), lives in older houses. The most intersting is the distribution of the average room numbers, the joint normal distribution is composed of two kindof “heavy-tailed” distribution to opposite directions: the lower status with fewer rooms, and the upper status with more rooms.
I personally like PCA so let’s visualize also with that (even though this is not listed as an extra point exercise)
pco = prcomp(df_scaled)
pco$x %>% as_tibble() %>% mutate(cluster = as_factor(km$cluster)) %>% ggplot() +
aes(x = PC1, y = PC2, color = cluster) +
geom_point()
And indeed: PC1 picks up exactly the same difference what was achieved with clustering, however suggests that the separation is not binary in the full data (there are features that mix these societes together).